Skip to contents

Here, we show the functionality of an SDForest and how you can use it to screen for causal parents of a continuous response in a large set of observed covariates, even in the presence of hidden confounding. We also provide methods to asses the functional partial dependence of the response on the causal parents.

We show the functionalities with simulated data from a confounded model with two true causal parents of \(Y\), see simulate_data_nonlinear(). vignette("realWorld") shows an example using real data.

library(SDForest)
set.seed(42)
# simulation of confounded data
q <- 2    # dimension of confounding
p <- 350  # number of overved covariates
n <- 300  # number of obervations
m <- 2    # number of causal parents in covariates

sim_data <- simulate_data_nonlinear(q = q, p = p, n = n, m = m)
X <- sim_data$X
Y <- sim_data$Y
train_data <- data.frame(X, Y)

# causal parents of y
f_1 <- apply(X, 1, function(x) f_four(x, sim_data$beta, sim_data$j[1]))
f_2 <- apply(X, 1, function(x) f_four(x, sim_data$beta, sim_data$j[2]))

# functional dependence on causal parents
par(mfrow = c(1, 2))
plot(x = X[, sim_data$j[1]], f_1, xlab = paste0('X_', sim_data$j[1]))
plot(x = X[, sim_data$j[2]], f_2, xlab = paste0('X_', sim_data$j[2]))

We first estimate the random forest using the spectral objective:

\[\hat{f} = \text{argmin}_{f' \in \mathcal{F}} \frac{||Q(\mathbf{Y} - f'(\mathbf{X}))||_2^2}{n}\] where \(Q\) is a spectral transformation. (Ćevid, Bühlmann, and Meinshausen 2020), (Ulmer?)

fit <- SDForest(Y ~ ., train_data)
fit
#> SDForest result
#> 
#> Number of trees:  100 
#> Number of covariates:  350 
#> OOB loss:  3.9 
#> OOB spectral loss:  0.08

Causal parents

Given the estimated causal function, the first question we might want to answer is which of the covariates are the causal parents of the response. If we intervene on the causal parents, we expect the response to change. For that, we can estimate the functional dependency of \(Y\) on \(X\), \(\widehat{f(X)}\) and examine which covariates are important in this function. We can directly compare the importance pattern of the deconfounded estimator to the classical random forest estimated by ranger. This comparison to the plain counterpart always gives a feeling of the strength of confounding. If no confounding exists, SDForest() and ranger::ranger() should result in similar models.

In the graph below, we see the variable importance varImp() of the deconfounded random forest against the plain random forest. The scale has no meaning, but we see how the two true causal parents in red are getting a clear higher variable importance for the SDForest. The plain random forest cannot distinguish between spurious correlation and true causation.

# comparison to classical random forest
fit_ranger <- ranger::ranger(Y ~ ., train_data, importance = 'impurity')

# comparison of variable importance
imp_ranger <- fit_ranger$variable.importance
imp_sdf <- fit$var_importance
imp_col <- rep('black', length(imp_ranger))
imp_col[sim_data$j] <- 'red'

plot(imp_ranger, imp_sdf, col = imp_col, pch = 20,
     xlab = 'ranger', ylab = 'SDForest', 
     main = 'Variable Importance')

Before, we looked at the variable importance of the non-regularized SDForest. We have two more techniques to better understand which variables might be causally important. The first is the regularization path regPath(), where we plot the variable importance against varying regularization. The option lets us visualize these paths interactively to better understand which covariates seem to have robust importance in the model.

# check regularization path of variable importance
path <- regPath(fit)
plot(path, plotly = TRUE)